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Abstract 

We propose novel methodology for testing equality of model parameters between 
two high-dimensional populations. The technique is very general and applicable to 
a wide range of models. The method is based on sample splitting: the data is split 
into two parts; on the first part we reduce the dimensionality of the model to a man- 
ageable size; on the second part we perform significance testing (p-value calculation) 
based on a restricted likelihood ratio statistic. Assuming that both populations arise 
from the same distribution, we show that the restricted likelihood ratio statistic is 
asymptotically distributed as a weighted sum of chi-squares with weights which can 
be efficiently estimated from the data. Arbitrary splitting of the data results in a 
"p-value lottery". In order to get reproducible results we repeat the splitting proce- 
dure various times and aggregate the resulting p-values. This multi-split approach 
provides improved p-values. We illustrate the use of our general two-sample approach 
in two-sample comparison of high-dimensional regression models ( "differential regres- 
sion") and graphical models ("differential network"). In both cases we show results 
on simulated data as well as real data from recent, high-throughput cancer studies. 

Keywords High-dimensional two-sample testing; data splitting; £i-regularization; 
Non-nested hypotheses; High-dimensional regression; Gaussian graphical models; Dif- 
ferential regression; Differential network 
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1 Introduction and Motivation 



We consider the general two-sample testing problem where the goal is to determine whether 
or not two independent populations U and V, parameterized by 4> u and (j) v respectively, 
with <p u ,(j) v G $ C MP, arise from the same distribution. The hypothesis testing problem 
of interest is 

H : <p u = <t>v against H A : (fi u / <p v . (1.1) 

In this paper the focus is on the high-dimensional setting where the number of avail- 
able samples per population (n u and n„) is small compared to the dimensionality of the 
parameter space p. 

In a setup where n u and n v is much larger than p the ordinary likelihood-ratio test offers a 
very general solution to problem (jl.ip . It is well-known that under Ho the likelihood ratio 
statistic is asymptotically Xp distributed which then allows computation of confidence 
intervals and p- values. However, in settings where p is large compared to sample sizes, the 
likelihood-ratio statistic is ill-behaved and obviously, asymptotic approximations cannot 
be directly used to test statistical significance. 

Our approach for solving the general hig h-dimensional two-sample prob lem (jl.ip is moti- 
vated by the screen and clean procedure ( Wasserman and Roeder . 20091 ) originally devel- 



oped for improved variable selection in the high-dimensional regression model. The idea is 
to split the data from both populations into two parts; perform dimensionality reduction 
on one part and pursue significance testing on the other part of the data. In more detail, 
on the first split we select three subsets of $ by screening for the most relevant parameters 
of population U and V individually but also by selecting the important parameters of the 
pooled data from both populations. In this way we obtain an individual model with dif- 
ferent parameter spaces for U and V and a joint model which describes both populations 
by a common parameter space. On the second split of the data, we then can compute the 
likelihood ratio statistic between these two models, the so-called restricted likelihood ratio 
statistic. A crucial observation is that the two models are non-nested and therefore non- 
standard tools are needed to obtain the asymptotic null distrib ution. We app ly the theory 
on model selection and non- nested hypotheses developed by IVuongl ( 19891 ) to the two- 
sample scenario and show that the null distribution asymptotically approaches a weighted 
sum of independent chi-squared random variables with weights which can be efficiently 
estimated from the second split of the data. Importantly, the weighted sum of chi-squared 
approximation is invoked only in the second split, where the model dimensionalty has been 
reduced. 

As indicated above our approach involves a screening or model selection step prior to 
significance testing. Our method requires that the parameter set selected by a specific 
screening procedure contains the true model parameter [screening property) and that the 
selected set is small compared to the sample size (sparsity property). The first property is 
needed for deriving the asymptotic null distribution, whereas the latter property justifies 
asymptotic approximation. The two conditions which we impose here are much weaker 
than requiring consist ency in variable selection. We propo se to use ^-penalized maximum 
likelihood estimation ( Tibshirani , 1996 ; Fan and Lil . l200ll ) with regularization parameter 
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chosen by cross-validation. This sets automatically non-relevant parameter components to 
zero and allows for efficient screening even in scenarios with very large p. l\ penalization 
leads to sparse estimates and there is theoretical evidence that the screening property 
holds under mild additional assumptions. 

In current applied statistics, a wide range of applications are faced with high-dimensional 
data. Par ameter estimation in the "large p, small n" setting has been extensively studied 
in theory (jBiihlmann and van de Geerl . l201lh and also applied with success in many areas. 
Since interpretation of parameters is crucial in applied science there is a clear need to assess 
uncertainty and statistical significance in such settings. Howe yer, significance tes t ing in 
hig h-dimensional sett ings has only recently attracted attention. lMemsha.se, et all & 
and lBiihlmannl (|2012l ) consider testing in the h igh-dimensional linea r mode l. In the context 
of hig h -dimensional two-s ample comparison Bai and Saranadasa ( 19961 ). Chen and Qinl 
( 20ld ). Lopes et al. ( 2012 ) a nd ot hers treat testing for differences in the population means. 
And recently, Li and Chen ( 20121 ) developed a two-sample test for covariance matrices of 
two high-dimensional populations. 

The approach proposed in this paper tackles the high-dimensional two-sample testing in 
a very general setting. In our methodology <j) u and <f> v can parameterize any model of 
interest. In the empirical examples we show below we focus on two specific applications 
of our approach: (i) High-dimensional regression, where two populations may differ with 
respect to regression models and (ii) Graphical modelling, where the two populations 
may differ with respect to conditional independence structure. In analogy to the term 
"differential expression" as widely-used for testing means in gene expression studies, we 
call these "differential regression" and "differential network" respectively. Both high- 
dimensional regression and graphical models are now widely used in biological applications, 
and very often scientific interest focuses on potential differences between populations (such 
as disease types, cell types, environmental conditions etc.). However, to date in the high- 
dimensional setting, two sample testing concerning such models has not been well studied. 
The methodology we propose offers a way to directly test hypotheses concerning differences 
in molecular influences or biological network structure using high-throughput data. 

The organization of the paper is as follows: Section [2] introduces the setup and the method- 
ology: Section 12.11 explains variable screening using ^i-regularization; Section 12.21 intro- 
duces the restricted likelihood-ratio statistic; Sections 12.31 and 12.41 derives its asymptotic 
null distribution and Section 12.51 shows how to compute p- values using the single- and the 
multi-split algorithm. In Section[3]we present and give details on the two specific examples 
of our approach that we outlined above {differential regression and differential network). 
Finally, in Section HI we evaluate our methodology on simulated and real data from two 
recent high-throughput studies in cancer biology. 



2 Data Splitting, Screening and Non-Nested Hypothesis 



Consider a conditional model class given by densities 

d(y\x;(p), y G M k , x G R l , z = (y,x) and (f> G $ C 



(2.2) 
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Note that x can be empty (i.e., I = 0). Let population U = (Y U ,X U ) and V = (Y V ,X V ), 
where X u , X v are both generated from the same distribution and conditional on the X's, 
Y u and Y v are generated from d(-\x;<fi u ) and d(-\x\ tfi v ) respectively. 

The goal is to solve the general two-sample testing problem (jl.ip given data-matrices 
U = (Y U ,X U ) e R««x(fc+l) anc i v = (Y t ,,X tI ) e R»« *(*+'), representing i.i.d. random 
samples of f7 and V. We consider the high-dimensional case with p 3> min{n„,n u } and 
assume that the data generating parameters 4> u and ^„ are sparse, meaning that many 
of their components are equal to zero. If k = 1, 1 = 0, we have a classical (univariate) 
two-sample set-up. If k = 1 and is the univariate Normal distribution with 

mean /3 T x and noise variance a 2 , (f> = (/3,<r 2 ), then U and V follow linear regression 
models. If / = and d(y; (f>) is the multivariate Normal distribution with <f> representing 
the inverse covariance matrix, then U and V follow a Gaussian graphical model and (II. ip 
asks whether or not two graphical models (in short: networks) are significantly different. 
We will treat these two examples in detail in Section [3] and [4l We refer to the regression 
case as differential regression and to the Gaussian graphical model case as differential 
network. 

Our method ology for solving (11.11) is based on sample splitting and has its inspiration from 



the work by Wasserman and Roeder ( 20091 ). We randomly divide data from population U 



into two parts U; n and U ou t of equal size and proceed in the same way with population V 
which yields Vj n and V out . In a first step the dimensionality of the parameter space $ is 
reduced by filtering out (potentially many) redundant components. We do this by applying 
a screening procedure X on Ui n and Vi n separately, but also on pooled data (Ui n , Vj n ). 
In this way we obtain models with lower dimensionality than the full dimension p. The 
first model describes populations U and V individually using different reduced-parameter 
spaces, whereas the second model explains both populations jointly with a single reduced- 
parameter space. In a second step, we then evaluate the restricted likelihood-ratio statistic 
on the held-out data U ou t and V out and perform significance testing. 



2.1 Screening and ^-regularization 

Consider i.i.d. data Z = (Y,X) £ ]R nx ( fc +0 with Y\X = x distributed according to 
d(-\x; 4> ) and </> D G $ C W. We have p 3> n and <j) is supposed to be sparse. 

A screening procedure X selects, based on data Z, a set A of active parameter components 
by a map 

X : Z A C {l,...,p}. 
The map X then defines the active parameter space by 

$Z(Z) = • • • , <f> P ) ■ <t>j = for a11 = X ( Z )}- 

There are two basic requirements on the screening procedure X. Firstly, the procedure 
should get rid of many non-relevant components in the sense that the set of active pa- 
rameter components X(Z) should be small compared to n. Secondly, the active parameter 
space $j(z) should contain the true parameter cj) . We refer to these requirements as the 
sparsity and screening property: 
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• Sparsity property: |X(Z)| is small compared to n. 

• Screening property: <p D € &z(Z)i if Z generated according to (j) . 

The screening property guarantees that the model {d(-\x;4>) : <p S $x(z)}> selected by X, 
is correctly specified in the se nse that it contain s the true density function d(-\x; 4> ). Ll- 
penalized likelihood methods ( Fan and Lil . l200l"l ) serve as our prime example of screening 
procedures. Consider estimators of the form 



j\ = argmax 



;Y|x) 



(2.3) 



where £(cf), Y|X) denotes the conditional log-likelihood and A is a non-negative regulariza- 
tion paramete r. In the context o f linear regression (j2.3j) coincides with the Lasso estimator 
introduced by iTibshiranil (119961). Other important exa mple of the form (12.31) include the 
elastic net ( Zou and Hastie . 20051 ). the grouped Lasso ( Yuan and Linl . I2006T ) , the graphi- 



cal L asso (iFriedman et al\ . 120081 ) or the Lasso for generalized linear models (IPark et al 
20071 ) . Estimators of the form (|2.3p have been shown to be extremely powerful: they are 
suitable for high-dimensional data and lead to sparse solutions. A screening procedure 
can be defined by setting 



X A (Z) = {j : ^ 0}. 



(2.4) 



The question is whether or not 2a (Z) satisfies both the sparsity and the screening prop- 
erty. In principle the answer depends on the choice of the tuning parameter A. Too 
strong regularization (too large A) leads to very sparse solutions, but with relevant pa- 
rameter components incorrectly set to zero, whereas too little regularization (too small 
A) results in too large active sets. Common practice is to choose A in a prediction op- 
timal manner, for example by optimizing a cross-validation score. This gives typically 
sparse solutions with a large r num ber of non-zero components than the true number 



( Meinshausen and Biihlmann . 20061 ). It seems to be reasonable in practice to assume 
that the sparsity and sc reening property hold for the procedure obtained via i\- 



regularization. In fact, iBiihlmannl (j2012l ) points out that the screening property is a 
very useful concept for £i-regularization which holds under much milder assumptions than 
consistency in variable selection. 

By applying the screening procedure I\ on Uj n and Vj n separately, we obtain active-sets 



/ u =X A (U in ), I v =l x (V in ). 



Further we obtain active-set 



4„=X A ((U in ,V in )) 

by applying X jointly on (Uj n , Vi n ). For the rest of this section we treat the active-sets 
I u , I v and I uv as fixed, i.e., not depending on the sample size. 

2.2 Restricted likelihood ratio statistic 



We define model Mj i nt with shared parameter space $>i uv for both population U and V 
jointly as 
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d(yu\x u ; 4>uv)d(y v \x v ; 4> uv ) : <p uv G j 
and model Mi n( j with individual parameter spaces 3>/ u and for each population as 



M ind = |d(z/u|a;u;^«)rf(2/t;|a;«;^«) : {(j>uAv) e $/ M x 

The log-likelihood functions with respect to models Mj i n t and M in d are given by 

nu Tly 

^nSiv^uv) = y~] log d(Y Uji |X M;i ; (j> uv ) + ^ log d(Y Vji |X,^ ; (f) uv ) 

i=l i=l 

Tin TId 



i=l t=l 



We will test hypothesis based on the restricted log-likelihood ratio test-statistic 
defined as 

LR W „ = 2{maxL^(^,^)-maxL^(0)} (2.5) 

= 2{L™ nv (4> u , 4> v ) - v£X (4> uv )}, 

where (4> u , <f> v ) and <j> uv denote the maximum likelihood estimators corresponding to models 
Mind and Mj oirit . 

It is crucial to not e that testin g based on (|2.5|) is non-trivial and involves non-nested 
model comparison (|Vuond . [l989h . The difficulty arises from the fact that model Mj i nt is 



not nested in Mj n d, or in other terms $j{Z! $>i u U &i v - Non-nestedness of these models is 
fundamental and is not only an artefact of the random nature of the screening procedure. 
If (p u 7^ <f) v , than X applied on two different populations mixed together will set different 
components of <j> to zero than X applied two both populations individually. This is a 
consequence of model misspecification and is also well known as Simpson's paradox in 
which an association present in two different groups can be lost or even reversed when the 
groups are combined. 

Under Ha, or if <j) u / (f) v , then we have: 

Proposition 2.1. Assume that the screening property holds and consider the sets I U ,I V 
and I uv as fixed. Then under Ha and regularity assumptions (Al)-(AS) (listed in Ap- 
pendix^): 

LR nu<nv =2{L^ nv ^ u J v )-LZ%^ uv )}n-^ K,n^oo). 



A proof is given in Appendix[A] Based on Proposition 12 . 1 1 we reject Ho if LR nui n v exceeds 
some critical value. The critical value is chosen to control the type-I error at some level 
of significance a. Alternatively, one may directly compute a p-value. In order to com- 
pute a critical value and/or determine a p-value we deduce in Section [2.31 the asymptotic 
distribution of LR^^ ^ under Hq. 
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2.3 Asymptotic Null Distribution 



The work of Vuon 1 (|l989h specifies the asymptotic distribution of the log-likelihood 



ra- 



tio statistic for comparing two competing models in a very general setting, in particular 
Vuong's t heory covers t he case where the two models are non-nested. We apply Theo- 
rem 3.3 of lVuonel <|l9SSh to the special case where the competing models are Mj ; n t and 



M ind . 

Let <f>* uv and (0*,<^>*) be pseudo-true values of model Mj ; nt , respectively M^: 
<t>* uv = arg min (EpE^ [log d(Y u \X u ; </>)]] +E[E^ [log d(Y v \X v ; 0)]]) , 
= argminE[E Jlogd(y u |Z u ;i/;)]], #j = arg minE[E „ [log d(Y v \X v ; £)]]. 

Define for a, b G {u, v, uv}, c G {tt, and sets A,Bc{l,...,p} 

d 

s A (y\x;(p) = ——logd(y\x;<p) and 
d<f>A 

B c AB {4>*aAt) = m<t>AsA{Y\X;cp* a )sB{Y\X;ct>t) T ]]. 
We further consider the matrices 

BmWKAI) = ^ { f u) Bv^yj , B Mioint (^ uv ) = Bl v (t* uv ) + Bl v (<j>* uv ) 

and 

BM ioiatMin Muvi<,<f>*v) = {Bt vI Mlv-,K),Bt vIv (r uv -,ti)) . 

The following theorem establishes the asymptotic distribution of LR nuirit) . 
Theorem 2.1 (Asymptotic null-distribution of restricted likelihood ratio-statistic). As- 
sume that the screening property holds and consider the sets I u , I v and I uv as fixed. If 
4>u = 4*v{ = '- <P) an d under regularity assumptions (A1)-(A6) (listed in Appendix\M) we 
have: 

LRn u ,n v ->" *r(S^), 

with r = \I U \ + \I V \ + \I UV \ and VPr("j v) denotes the distribution function of a weighted sum 
ofr independent chi-square distributions where the weights Vj, j = 1, . . . ,r are eigenvalues 
of the matrix 

W = [ Xr « +r - ^M fad M joint -^Mjojnt I ^.6) 



' UV 



A proof is given in Appendix O Set J = I uv C\I U C\ I v , I a = I a — J, h = h — J and 

Q c iJh {4>lAD = B c Ub {KAl) - Bl^lAm^^lAl^B^^At)- 

If we assume that the screening property holds then model Mind is correctly specified and 
the pseudo-true values {4>u->4>V) equal the true values (4> u ,(f> v ). Furthermore, if we have 
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4>u = <t>v(=' 4>) then the screening property guarantees that also Mj i nt is correctly specified 
and that d>*„, = 4>. We then write 



B c ab(^ ti) = B AB and Q] , 4>t) = Qfj. 

In Appendix [X] we prove the following proposition which characterizes the weights Uj 
defined as eigenvalues of the matrix W in Theorem 12, li 

Proposition 2.2 (Characterization of eigenvalues). The eigenvalues Vj, j = 1, . . . \I U \ + 
\Iv\ + \Iuv\t °f matrix W in Theorem 1 2. 1\ can be characterized as follows: 

If | ly, \ ~^~\Iv\ I J-uv I • 

• 2 \ J\ eigenvalues are 0. 

• \I U \ + \I V \ — \I UV \ eigenvalues are 1. 



The remaining eigenvalues equal ±*/l — Uj, where are eigenvalues of 

(Qi iQfQii +Qj jQfQii )(2Qi J" 1 . (2.7) 



If I luv I I lu I \Iy\ 

• 2 I J I eigenvalues are 0. 

• | I — (|-^u| + + \J\ eigenvalues are -1. 



The remaining eigenvalues equal — Uj, where fij are eigenvalues of 

(Qi j (2Qi T l Qi i )Qf {Qi i (2Qj y l Qf f )Qf 
(Qj / (2Q / )" 1 g 1 / )Qf (Qi i (2Q } )~ 1 Q } } )QJ 1 



(2.8) 



Expressions (|2.7p and (|2.8p of Proposition 12.21 have nice interpretations in terms of ana- 
lyzing the variances of the models M- m & and Mj ; nt . Consider the random variables 

1 n 

Z C =^Y] s(Y c4 \X C:i ; $) (c G {u, v}), 

which are asymptotically Normal distributed with mean zero. If res 1 ^ and resj b denote 
the residuals obtained from regressing Zf a and Zf b against Zj, then Qj j is the covariance 
between these residuals. It is easy to see that the matrix Q? ? Qj x Qf f equals 

Varfresr )— Varfres 1 ? Ires'? ), 

and therefore expression (|2.7p of Proposition 12.21 can be interpreted as the asymptotic 
variance of model Mj i nt not explained by model M^. Similar, we can see that (|2,8p 
describes the variance of model M- m & not explained by Mjoint- 

We further point out two special cases of Proposition 12.21 If Mj i nt is nested in M in( j, i.e., 
\Iuv\ = 0, then LR nujnu follows asymptotically a chi-squared distribution with degrees of 
freedom equal |/u| + |/u| — \I U v\- If I At I = \Iv\ = then LR nujriij is asymptotically distributed 
according to X 2 \j\ ~ X|/ U „ H jy 
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2.4 Estimation of the weights v in ^ r (-] v) 



In practice the weights v of the weighted sum of chi-squared null distribution have to be 
estimated from the data. In light of Theorem 12. 11 it would be straightforward to estimate 
the quantities Bj aI (cj)*, and plug them into expression (|2.6p to obtain an estimate W. 
Estimating v then involves computation of r = \I u \ + \Iv\ + \Iuv\ eigenvalues. Despite model 
reduction in the screening step, r can be a rather large number and can result in inefficient 
and inaccurate estimation. However, if both populations arise from the same distribution, 
then the overlap of the active-sets J = I uv n I u n I v is large compared to I u , I v and I uv . 
According to Proposition l2.21 the number of eigenvalues s = mm{\I uv \ — | J|, \I U \ + \I V \ — \ J\} 
which remain to be estimated is small compared to r. We therefore estimate matrix (|2.7p 
by 

{Q u i i{Q)Y l Q u n +Q) fiQ^r'Qh ){Q) +Q) T\ (2.9) 

N JuvJu J-u J-u-t-uv J-uv-t-v J-v J-vJ-uv' J-uv J-uv 



(2.10) 



and similarly matrix (12. 8p by 



(Qv (Q| +QH )" 1 Q| ^(Qjf)" 1 (Q| | (Q| +Q| )" 1 Q| f ] 

J-u J-uv J-uv 1 uv JuvJu J-u JuJuv 1 uv Juv 1 uvJv J V 

(Q v f f (Q u f +Q} / HQ?) -1 (Q u f +Q] ^Q) } )(Q V } ] 

JvJuv Juv Juv JuvJu Ju JvJuv Juv Juv JuvJv Jv 



Here, Q% • = B% . —B% T B C IT B C - and Br T denotes a consistent estimator of Br T (<b* 

'^Iah lah IaJ JJ J h IaIb lah^O, 

One possibility is to use the sample analogs: 



Blah, sample = ~ s h 0^c,i\^c,u 4>a) Sj t (Y C| t|Xc ; i; <j) b ) T 



t=l 



Another way is to plug-in estimators <p u and ip v into the expectation with respect to <p u 
and 4>v'- 

n c 

^U,plug-in = -T, E kl Sl "P\XcSJa) S Ib (Y\X C!i a b f]. (2.H) 
i=l 

Figure [1] shows for a linear regression example with I = 100 predictors the estimated 
weights v (upper panels show u obtained using directly Theorem 12. 1( lower panels show 
v obtained via Proposition I2.2p . Estimating the eigenvalues with help of Proposition 12.21 
works better than direct computation according to Theorem 12.11 In particular the true 
zero eigenvalues are poorly estimated with the direct approach. Figure [2] illustrates for the 
same example the quality of approximation of the ordinary and restricted likelihood-ratio 
with their asymptotic counterparts. The weighted sum of chi-squares approximates well 
already for small sample sizes, whereas the Xp comes close to the distribution function of 
the ordinary likelihood-ratio only for very large n's. 



2.5 P-Values and Multi-Splitting 

In this section we demonstrate how to obtain p-values for testing hypothesis (jl.ip using 
the methodology developed in Sections I2.lti2.4i The basic workflow is to split the data 
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n.100 



n-250 



n=500 



n-10000 




Figure 1: True weights (in red) and estimated weights (boxplots) for a regression ex- 
ample. Upper row shows estimates obtained by directly computing the eigenvalues 
of W, Theorem 12.11 Lower row shows estimates obtained via Proposition 12.21 and 
equations (|2.9p - (|2.10p . [Linear regression model with regression coefficients f3 £ M 100 , 
/3j = 1 (j = 1, . . . , 5) and zero else where, a 2 = 1. I u , I v and I uv obtained using sampled 
data of size (n :=)n u = n v = 100. Eigenvalues computed using sampled data of size 
( n :=)n u = n v = 100, 250, 500, 10000.] 
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n=100 



n=250 



n=500 



n=10000 




Figure 2: Distribution functions for ordinary and restricted likelihood-ratio statistic and 
asymptotic approximations. Upper row of panels: approximation of restricted likelihood- 
ratio (black) by the weighted sum of chi-squared distribution (red). Lower row of panels: 
approximation of ordinary likelihood-ratio (black) by Xp distribution (red). [A linear 
regression model with I = 100 predictors (for more details see caption of Figure [T]).] 



into two parts, do screening on one part and derive asymptotic p- values on the other part. 
A p-value computed based on a one-time single split de pends heavily on the arb itrary 



choice of the split and amounts to a "p-value lottery" (|Meinshausen et al. , 20091 ): for 



finite samples and for some specific split, the screening or the sparsity property might be 
violated which then results in a erroneous p -value. An alternat i ve to a single arbitrary 
sample split is to split the data repeatedly. Meinshausen et al. ( 20091 ) demonstrated in 
the context of variable selection in the high-dimensional regression model that such a 
multi-split approach gives improved and more reproducible results. We adopt this idea 
and divide the data repeatedly into k = 1,...,K different splits. Let (U-^V-JJ and 
(Uout) ^out) be the first and second half of split k. On the first half screening is performed 
by solving three times the ^-regularized log-likelihood problem (|2,3p . individually for each 
population and for both populations pooled together. The tuning parameter A is always 
chosen by cross-validation. This gives models M^ oint and M^ nd defined via active-sets 
X Acv (Uf n ), X Acv (VfJ and Z Acv ((Uf n , Vfj) . Then, the restricted log-likelihood ratio LR fc 
is evaluated on the second half of split k and a one-sided p-value is computed according 
to 



1 - f r (LR* 



where r = |2 Acv (Uf n )| + |J Acv (VfJ | + |X Acv ((Uf n ,VfJ) | and is defined in The- 



orem [2TT1 The wei ghts v are estimate d from the second half of split k as described in 
Section [23 



As in 



Meinshausen et al. ( 20091 ) we aggregate p- values V k ,k = 1,...,K, 



obtained from all different splits using the formula: 
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^agg = (1 - 7min) mf Oy ( {P b / T , b = 1, . . . , B}) , 

7G(7min,l) V ' 

where g 7 (-) is the empirical 7-quantile function. This procedure is summarized in Algo- 
rithm [TJ We refer to the choice K = 1 as the single-split method and to the choice K > 1 
as the multi-split method. 

Algorithm 1 Single- and multi-split algorithm for high-dimensional two-sample testing 
Input number of splits K. 
1: for k = 1,... ,K do 

2: Randomly split data into (Uf n , U* ut ) and (Vf n , V^ ut ). 

3: Screening on (Uf n , V^ n ) \Z\(-). defined in (|2.4p . A cv obtained by cross-validation]: 
Compute active-sets 7* ^ X Acv (Uf n ); 7* ^ X Acv (VfJ; /*„ <- Z Acv (Uf n , Vf n ). 

4: Significance Testing on data (Uj ut ,Vj ut ) 

Compute test-statistic: LR fc <- 2{L™ d ^(<^ u , - L J n™n„(<^)} 

Compute p- value: V k 4- 1 - * r (LR fe ; £) 
5: end for 

Output 

if K = 1: "P «- P 1 . 

if if > 1: Pagg <- (1 - 7min) mf g 7 ({7? fc /7; k = 1, • • • , K}) . 

7£(7min,l) 



3 Examples 

As mentioned in the beginning of Section [2] two examples of our approach are differential 
regression where the populations follow a linear regression model and differential network 
where the populations are generated from Gaussian graphical models. In this section we 
provide details on both examples. 

Differential regression Consider a regression model 

Y = Xp + e, (3.12) 

with Y C R (k = 1), X C R l (I = p - 1) and random error e ~ M(0,a 2 ). With 
4> = (/3, a 2 ) 6l p the score function is given by 

f(y-£x)x 1 ({y-£xf \\ 

= { — ^ — ^(^2 VJ- 
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In Appendix [B] we show that 



ft 



) T XX T (f3 c 



ft)) .(3.13) 



Now, given data U = (Y U ,X U ) and V = (Y„,X„) we obtain p-values by following Al- 
gorithm 1 (see Section [2.5p . For the regression model, li -penalize d maximum likeli hood 
estimation, used in the screening step, coi ncides with the Lasso ( Tibshirani . 19961 ) and 



is implemented in the R-package glmnet ( Friedman et al. . 2010l ). 



With the help of for- 
mula (|3.13p we can easily compute plug- in estimates Bj aIb p i U g_ in (see equation (|2.1ip ) 
and then the weights v of the asympto tic null distrib ution are obtained as outlined in 
Section 12.41 We use the algorithm of Dayies (Il980l ). implemented in the R-package 



CompQuadForm (IDuchesne and de Micheauxl . 
of VJ-,t>). 



20101 ) . to compute the distribution function 



Differential network Let Y C R k (I = 0) be G aussian distributed with zero mean 
and covariance E, i.e., JV(0, E). A Gaussian graphical model with undirected graph G is 
then defined by locations of zero entries in the inverse covariance matrix £1 = E" 1 , i.e., 
(jJ')^G O Vtjji = 0. Setting <j> = {Qjj>}j>j> G MP (p = (k+l)k/2) then the score function 
is given by 

S(jj0 (y^) = y(%^-%. 

By invoking formulas on fourth moments of a multivariate normal distribution (see Ap- 
pendix [B]) we find: 

^c[ s (j',i')(^' ^a) s (l,l')(Y; 4>b)] = ^c,jj'^c,ll' + EcjjEcj/j/ + £ Cj jj/E CJ -/j (3-14) 

In the Gaussian graphical model case ^-regularized maximum likelihood estimation is 
well-known unde r the name Graphical Lasso (or GLasso) and is implemented in the R- 
package glasso ( Friedman et al. . 20081 ) . As in differential regression, given data U = Y^ 



and V = Y v , plug-in estimates Bj a j b p i ug _i n are computed by formula (|3.14p . weights v are 
obtained subsequently as described in Section 12.41 and for p-value calculation we follow 
Algorithm 1. 



4 Numerical Results 
4.1 Simulations 

We consider the following simulation settings: 

Setting 1 Differential regression: Generate population U = (Y U ,X U ) and V = (Y V ,X V ) 
according to regression model (I3.12p . X u and X v are generated according to Af(0, E) 
with E jf = 0.5lJ— We set (n :=)n u = n v = 200 and I = 7,10,25,50,100,195 
(I = p — 1) and choose a u and a v such that the signal-to-noise ratio (SNR) equals 
10. 
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Under Ho, the regression coefficients (3 = j3 u = j3 v have 5 non-zero elements (at 
random location) with values 1. 

Under Ha, the regression coefficients (3 U and (3 V have entries of values 1 at three 
common locations and have entries of value a± at two different locations. a\ = 
0.25,0.5. 

Setting 2 As setting 1 but with SNR = 5. 

Setting 3 As setting 1 but as a predictor matrix X we use for both populations gene ex- 
pression data from n = 594 ovarian carcinomas. The data is publicly available at The 



Cancer Genome Atlas (TCGA) data portal (http://www.cancergenome.nih.gov) 



We select the I = 7, 10, 25, 50, 100, 250 genes exhibiting the highest empirical variance 
among samples. We choose a u and a v such that SNR=10. 

Setting 4 Differential network: Generate both population according to Y ~ A/"(0, E) 
with (n :=)n u = n v = 300 and k = 5,10,25,50,75. Note, that Y C R k and 
p= (fc + l)fc/2. 

Under Ho, the inverse covariance matrices £l u and Q v are equal and have k non-zero 
entries at random locations. 

Under Ha, the inverse covariance matrices Q u and £l v have \kat2~\ non-zero entries 
in common and k — \kat2~] non-zero's at different locations, «2 = 0.5, 0.8. 

The values a\ and a-i control the strength of the alternative for the regression and network 
cases respectively: A larger value a% results in two regression models which differ more. On 
the other hand a larger ct2 signifies that the two networks have more edges in common and 
are therefore more similar. For each of the four settings we perform 500 simulation runs for 
the null- and the different alternative hypothesis. We then compute for each run a p-value 
with the single- and the multi-split method (50 random splits, 7 m i n = 0.05). We compare 
the proposed methodology with p- values obtained using the asymptotic Xp-distribution of 
the ordinary likelihood-ratio statistic. We call this latter approach ordinary likelihood- 
ratio test. Further we compare with a permutation test where we use the symmetric 
Kullback-Leibler distance between population based ^-regularized estimates as a test 
statistic. This permutation test uses 100 random permutations and further details are 
given in Appendix For each setting we evaluate the fraction of wrongly rejected null 
hypothesis (false positive rate) and the fraction of correctly rejected hypothesis (true 
positive rate) at a significance level of 5%. 

Results are shown in Figures [3][U The false positive rate (FPR) of the ordinary likelihood- 
ratio explodes in all settings already for medium numbers of covariates I, respectively 
k. We further see that the single-split method is not able to control the true positive 
rate (TPR) at the 5% level in settings 1-2. This is most probably due to the fact that 
the sample size is too small and consequently the asymptotic approximation of the null 
distribution is not accurate enough. The multi-split method is much m ore conservative and 



has lo wer FPR. This observation is in agreement with results shown in lMeinshausen et al 
(j2009l ). As expected, the permutation test controls the false positive rate at the 5% level. 
Concerning the TPR (or power of the test) we find that single-splitting, multi-splitting 
and the permutation test perform well in most settings. Only with weak alternatives 
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(setting 1 and 2 with ol\ = 0.25) the TPR decreases. Interestingly, the permutation test 
exhibits loss of power for large values of I, whereas single- and multi-split display higher 
power than the permutation test for large / but lower power for small I. For completeness 
Figures [8lfTT1 in Appendix [D] provide more insights on the single-split algorithm. These 
figures report results for settings 1-4 with respect to the sparsity and screening properties. 

For setting 1 with p = 100 we further examine the distribution of the p-values obtained 
using the single-split method and the ordinary likelihood-ratio test when varying the 
sample size (n :=)n u = n v = 200, 500, 1000, 5000. For every n we use 100 samples in 
the screening step of the single-split method and the rest of the samples are used for p- 
value calculations. Results are shown in Figure El We see that p-values obtained from 
the single-split method are very well approximated by the uniform distribution function. 
Only in the scenario with n = 200 the approximation is not accurate enough which is also 
reflected in a too large number of false positives. 



data: HO data: HA, alpha"! =0.5 data: HA, alpha! =0.25 




i 1 1 1 1 r n 1 1 1 1 r n 1 1 1 1 r 

1=7 1=10 l=25 l=50 1=100 1=195 l=7 1=10 l=25 l=50 1=100 1=195 l=7 1=10 l=25 l=50 1=100 1=195 



Figure 3: Simulation study, Setting 1 (differential regression, SNR=10): False positive 
rate (FPR) and true positive rate (TPR) at 5%-level shown as a function of number of 
predictors (/). a\ indicates the strength of the alternative (the larger ot\ the stronger the 
alternative), ["ordinary lrt": classical likelihood ratio test; "single-split", "multi-split": 
see text; perm: permutation test as described in text]. 



4.2 Real Data 

We apply the single- and multi-split method to real datasets from cancer biology. For 
differential regression we take data from Broad-Novartis Cancer Cell Line Encyc lopaedia 
(CCLE) (jhttp : //www . broadinstitute . org/ccle/homep. Barretina et al. ( 20121 ) use the 



data to predict anticancer drug sensitivity from genomic data. They describe that the 
HDAC inhibitor panobinostat shows increased sensitivity in haematological cancers com- 
pared to solid cancers. We use the HDAC inhibitor panobinostat as a response variable and 
take gene-expressions of the I = 100 genes showing highest Pearson correlation over all sam- 
ples with panobinostat as the predictor matrix. The n u = 89 lung cancer cell lines form our 
population U and as population V we take the n v = 71 cell lines with haematopoietic and 
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data: HO 



data: HA, alpha1=0.5 



data: HA, alpha! =0.25 




Figure 4: Simulation study, Setting 2 (differential regression, SNR=5): False positive 
rate (FPR) and true positive rate (TPR) at 5%-level shown as a function of number of 
predictors (I). ot\ indicates the strength of the alternative (the larger ot\ the stronger the 
alternative), ["ordinary lrt": classical likelihood ratio test; "single-split", "multi-split": 
see text; perm: permutation test as described in text]. 




Figure 5: Simulation study, Setting 3 (differential regression, real predictor matrix): False 
positive rate (FPR) and true positive rate (TPR) at 5%-level shown as a function of 
number of predictors (I). ot\ indicates the strength of the alternative (the larger ot\ the 
stronger the alternative), ["ordinary lrt": classical likelihood ratio test; "single-split", 
"multi-split": see text; perm: permutation test as described in text]. 
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Figure 6: Simulation study, Setting 4 (differential network): False positive rate (FPR) 
and true positive rate (TPR) at 5%-level shown as a function of number of covariates (k). 
ct2 indicates the strength of the alternative (the smaller a\ the stronger the alternative), 
["ordinary lrt": classical likelihood ratio test; "single-split", "multi-split": see text; perm: 
permutation test as described in text]. 



n= 200 



n= 500 



n= 1000 



n= 5000 




I I I I I I I I I I I I I 

-0.1 0.2 0.5 0.8 1.1 




I I I I I I I I I I I I I 

-0.1 0.2 0.5 0.8 1.1 




I I I I I I I I I I I I I 

-0.1 0.2 0.5 0.8 1.1 




ordinary lrt 
single-split 



I I I I I I I I I I I I I 

0.1 0.2 0.5 0.8 1.1 




"i T 

ordinary lrt single-split 




"i r 

ordinary lrt single-split 



1 1 

ordinary lrt single-split 



1 1 

ordinary lrt single-split 



Figure 7: Distribution p- values, Setting 1, I = 100. Upper row of panels: empirical 
distribution function of p-values obtained by single-split method and ordinary likelihood- 
ratio test for different sample sizes (n :=)n u = n v = 200, 500, 1000, 5000. Lower row 
of panels: fraction of p-values which are smaller than 0.05 (type-I error at 5% level). 
[ "ordinary lrt" : classical likelihood ratio test; "single-split": see text]. 
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lymphoid-tissue origin. For differential network we use data from The Cancer Genome At- 



las (TCGA) data portal (http: //www. cancergenome . nih . govl) . We consider gene expres - 
sion data of the k = 41 genes present in the cancer gene list of Halm and Weinberg (2002) 
(downloadable at http://www.cbio.mskcc.org/CancerGenes) and compare n u = 594 
ovarian carcinomas against n v = 557 glioblastoma multiforme carcinomas. 

For both examples we compute p-values with the single-split method, the multi-split 
method and with the permutation test. We further carry out "back-testing" by divid- 
ing data randomly into populations U and V and then performing significance testing. 
All results are shown in Table [TJ In the regression example the single-split and multi-split 
method give considerably smaller p-values than the permutation test. A potential expla- 
nation could be that the permutation test exhibits small power in difficult scenarios (see 
settings 1 and 2 in Section 14.11 with ol\ = 0.25, i.e., right panel in Figures [3] and |4|) . In 
the differential network example all methods have p-values which are numerically indistin- 
guishable from zero. For the permutation test this implies that the observed test statistic 
is always larger than those obtained from random permutations. All p-values obtained for 
back-testing are large which is reassuring. 





single-split 


multi-split 


permutation 


Diffreg (lung vs. haematopoietic/lymphoid) 


7.7 x 10~ 5 


0.003758 


0.17 


Diffreg (back-testing) 


0.163936 


0.132444 


0.5 


Diffnet (ovarian vs. glioblastoma) 


< 10~ 6 


< 10~ 6 





Diffnet (back-testing) 


0.924996 


1 


1 



Table 1: Real data examples. P-values obtained on Cancer Cell Line Encyclopedia (CCLE) 
and The Cancer Genome Atlas (TCGA). Differential regression (Diffreg): Anti-cancer 
drug panobinostat regressed on gene expression data; lung cancer (n u = 89) against 
haematopoietic/lymphoid cancer (n v = 71). Differential network (Diffnet): Gene expres- 
sions from ovarian cancer patients (n u = 594) against gene expressions from glioblastoma 
multiforme cancer patients (n v = 557). 



5 Discussion and conclusions 



We have presented a novel and very general approach for high-dimensional two-sample 
testing. We combined ideas including sample-splitting, non-nested hypothesis testing and 
p-value aggregation to propose a methodology that allows two-sample testing for a wide 
range of high-dimensional models. We treated in detail linear regression (differential re- 
gression) and Gaussian graphical models (differential network) and validated their per- 
formance on simulated and real data. 

Our methodology is supported by asymptotic theory. However, our results obtained in 
Proposition 12.11 and Theorem 12.11 do not reflect a truly high-dimensional setup as they 
consider the active sets selected in the screening step as fixed. An aim of our ongoing 
research is to generalize these results to the case where the size of the active sets can grow 
with a rate of smaller order than the sample size. 
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Whilst the focus of this paper was not on applications, we note that the methodology we 
propose should have immediate utility in biology. Differential network can be used to test 
a number of hypotheses of current scientific interest. For example, in cancer biology it is 
widely believed that genomic differences between cancer types and subtypes may in some 
cases be manifested also at the level of biological networks, including gene regulatory and 
protein signaling networks. However, we are not aware of existing methodology that allows 
such hypotheses to be tested statistically under the relevant conditions of moderate sample 
size and high dimensionality. Our approach can be used to test such hypotheses directly 
from high-throughput data, as illustrated in the example above. A further application 
of the differential n etworks formu l ation of our approach is in gene-set testing. Currently, 
gene set tests (e.g., Irizarry et al. ( 20091 )) are not truly multivariate. Our approach could 
be directly applied to test differences in gene sets at the level of not only means but also 
covariances or networks. 
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A Proofs of Section [2] 

Without loss of generality, we take (n :=) = n u = n v . We assume that the following 
assumptions (Al)-(A6) hold: 
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(Al) Data U = (Y u , X u ) and V = (Y„,X„) are i.i.d. samples with X U ,X V drawn from 
some common density function and I^JA^ = x ~ d(-\x; (ft u ), Y V \X V = x ~ d(-\x; 4> v ). 
The conditional density function d{y\x; <j)) is strictly positive for almost all (y, x) and 
all <p. 

(A2) $ u , <S> V and $ uv are compact subsets of MP, and the conditional density function 
d(y\x](f)) is continuous in (p. 

(A3) For almost all (y,x), | log d(y\x; -)| is dominated by an integrable function indepen- 
dent of 0. The functions 

<f> u ^ E[E^[logd(F u |X u ;0 u )]] 
^ E[E^[logd(F„|^;^)]] 
^ ^ E[E^[logd(F u |X u ;0 w )]]+E[E^[logd(y u |X v ;^)]] 

have unique maximums on and at </>*, 0* and 

(A4) For almost all (y, x): log d(y\x; •) is twice continuously differentiable on <I>; |<9 log d(y\x; 
d log d(y\x; -)/d(p'\ and |<91og<i(y|x; -)/d(j)d(j)'\ are dominated by integrable functions 
independent on (p. 

(A5) If (0 : — )(/> u — </>t,, then is an interior point of <$> v and Further, is a 

regular point of B M - md and -B Mjoint • 

(A6) The information matrix equivalence holds, i.e., for all A C {1, . . . ,p} we have: 



E[E^[d log d{y\x- 4>)/d<l>A ■ 9 log d{y\x- 4>)/dfo 



-E[E^ 2 logd(y|^)/d0i]]. 



Proof of Proposition 12.11 Given assumptions (Al)-(A3) and noting that 0* 
(f>* = 4> v (a consequence of the screening property) we have 



E 



(4>u,<t>v) 



log 



d(Y u \X u \4> u )d(Y v \X v ;4>v) 
d(Y u \X u ;<f>* uv )d(Y v \X v ;<f>* uv ) 



D 



+ D 



(A-15) 



where D(^||<^') = E E^ log ^y\x'p) * s ^ e standard Kullback-Leibler divergence. If 
0m 7^ 4>v, then at least one term on the right hand side of equation (|A.15|) is strictly 
positive. Therefore, under Ha: 



LR n ,n 



oo n — > oo . 



□ 



Proof of Theorem 12.11 Consider the setting in Vuong ( 19891 ) with competing models 



Mi n< j and G~ 



M 



joint; 1-6. 



f(y\x;9) = d(y u \x u ;4> u )d(y v \x v ]<j) v ) and g(y\x;j) = d(y u \x u ; 4> uv )d(y v \x v ; <p uv ) 

with y = (y u ,y v ), x = (x u ,x v ), 6 = {cj) u ,4> v ) and 7 = <j> uv . If the screening property 
holds then model M- m & is correctly specified and the pseudo-true values ((/>*,(/>*) equal the 
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true values (4> u ,4>v)- Furthermore, if we have (0 :=)<j> u = 4>v then the screening property 
guarantees that also Mj i n t is correctly specified and that §* uv = 4>. As a consequence we 
have f(y\x;6*) = q ( y \ x ; 7* ) . Th erefore , assuming assumptions (A1)-(A5), it follows from 
Theorem 3.3 (i) iiilViiomil l|l98flh that 



LRn,n 4 ^r(s^), 

where r = |/ u | + + \I UV \ and the weights Uj, j = 1, . . . ,r, are eigenvalues of the matrix 

117 = A- A- J ■ (A16) 



The matrices Bf, B g , Bf g , B g j, Af and A g are defined as in IVuonel (|l989h . If we set 



B Mnd = B fi B M joint = B g and % joint tf tad = 
then we obtain as a consequence of the independence of the populations U and V: 



( B u \ 
B M ind = ( 0" w J > B AW = 5 L + 5 L 



and 

Bm- ■ i.m- j = (Bf j , BY r ) . 
By invoking the information matrix equivalence, assumption (A6), we find 



W 



Zr u +r v B M ind M ioirA B M . 3oixi 
B M io - mt M ind B M ind -Zr uv 



□ 



Proof of Proposition 12.21 Set r u = \I U \, r v = \I V \, r uv = \I UV \ and r = r u + r v +r uv . The 
eigenvalues of W are the solutions to the equation 



det(W - vX) = det [ [ D ~ U x BM ^^ B M) oiBt \ \ = 

\\ B M jotnt M iBd B M ind - X r uv -V J J 

Assume \I U \ + \I V \ > \I UV \. 

If v 7^ 1, then we find by using Schur's complement 
det(W - !/2} = det(^„ - A)det(J3^ 

where (J, = (1 — v 2 ). The second term on the right of equation ()A.17p has r uv roots. 
Therefore r u + r v — r uv of the eigenvalues Uj equal one. As 0* = <pl = 4>* uv , we can write 



B tiu V = Hiuiuvi&'iKo) and B tiuv = S H/„„(C; 



uv) 
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Solving 

det(S Mjoint M tad ^M ind - B Af tad M joint 5M. oint - (J2r uv ) = 

is equivalent to 

det ((Bf j (By ^Bf j -vBf ) + (B] T (Bf )- x B°j 7 - fiB] )) = 0. (A.18) 

\ v Juv J-u v J-u ' J-u^uv ' 1 UV ' v 1 UV 1 V ^ J-v y J-v-Luv ' J-uv ' / v 7 

An easy calculation involving Schur's complement shows 

(a?.)- 1 

Further, we find 



(Bj) +(BjT Bj} (QJ jiB-j)-' -(BJ)^B- } (Q) )" 

J 1i -'lit' J u 



iuviu\ iu> luiuv yB™ Q*i ■ (Q l o ■ + -B" (B U )~ 1 B U - 



Putting together, equation (|A.18P reads 



= det ,+2* J * (A ' 19) 

\ JuvJ luvJ / 

with 

* = fe f (Q u 1 T 1 Q] 1 +Q] jiQV'Qh 

+ B ] j(B u jT 1 B u j} +B] /Bfi- 1 ^ -KB] +B] ). 

-* uv •J <J 1 UV 1 UV 'J <J Juv Juv Juv 

From (fA~T9l) we deduce 

= (l-/i)^(*-(l- M )(5jf J + J B| J )( J B} + J B})- 1 (5« i + B° t )) 

\ J-uvJ J-uvJ <-> J-uv <J Juv / 

= (1 - M r- det ((Qjf i (Q])- 1 Q u u +Q) i {Q v i )~ l Q) i )(Q) + Q) J" 1 - 

\ J-uvJ-u J-u J uJ-uv J-uv 1 v J-v - L v- L uv Juv Juv / 

and conclude that 2 r u „ of the eigenvalues are zero and that the remaining eigenvalues are 
obtained by computing eigenvalues of 

{Qi iQfQa +Qi iQfQii )(2Qi 

The proof for the case with \l uv \ > \I U \ + \I V \ works similarly. □ 
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B Derivation of equations (13.131) and ( 13.141) in Section [3] 



Equation rt37L3j) Set e c = Y - X T /3 C and note that E 0c [e c ] = 0, E^Je 2 ] = a 2 c . We then 
find: 



E 0c ls{Y\X;<j> a )s(Y\X;< 



XX T 
XX T 
XX T 



{Y-X T p a )X\ f (Y-X T (3 b )X \ 
\ -I ) 



T 



E^ [(e c + X T (/3 C - /3 a ))(e c + X T (/3 C - ft))] 

E 0c [e 2 + e c X T (/3 C - (&+&)) + (& - /? a ) T XX T (ft - A)] 



(a c 2 + (/3 c -/3 a ) T XX T (/3 c -/3 & )). 



Equation (|3. 14j) Invoking formulas on fourth moments of a multivariate Normal dis- 
tribution we find: 

E^ c [s {jj/) (Y^ a )s m (Y^ b )] = E^ [(yO-)y(j') _ £ jy )(y(Dy(0 - E„,) 

= E^[yWy^)y(Oyao_y&0yyo E ^_y(Oy(r) E ^ +EaJi#Ewi> 

— Ec-yS&H' - E c wE a .,v/ + E ,yE& w. 



C Permutation test and symmetric Kullback-Leibler diver- 
gence 



The symmetric Kullback Leibler divergence is defined as 



D 



symm 



(^W) = D(0||^) + D(^| 



Consider a random partitition of the data into two groups U = (Y^,X^) and V 
(Y^X^). We construct a permutation test based on the test statistic 



Dsymm (0«,Acvll^,Aov): 



with 



6 fi>A = argmax^; Y s |Xu) -A||0||i, 

06$ 



>A = argmax£(0; Y^|X e ) - A||0||i 
</>e* 



and tuning parameter chosen by cross-validation (A = A cv ). 

For the regression model the Kullback-Leibler divergence equals: 



D((/3,a 2 )||(/3V 2 )) 



2 + 2cj' 2 



+ (/3 - P>) T E[xx T ]((3 - (3') + £ lo { 



cr 



/2 



v2 ' 
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For the Gaussian graphical model we have: 

Dm - ^(n-n-)- log (££)-*). 

D Additional Figures of Section |4] 



data: HO data: HO data: HA, alpha! =0.5 data: HA, alphal =0.25 




1=7 1=10 l=25 l=50 1=100 1=195 l=7 1=10 l=25 l=50 1=100 1=195 l=7 1=10 l=25 l=50 1=100 1=195 l=7 1=10 l=25 l=50 1=100 1=195 



Figure 8: Simulation study, Setting 1 (differential regression, SNR=10): Sparsity and 
screening property of single-split method over 500 simulation runs. Left panel shows 
the average number of non-zero parameter components of the joint model selected in the 
screening step (i.e., \I U v\) ■ The second to the fourth panels show frequencies which indicate 
how many times the screening property is satisfied. 



data: H0 data: HO D data: HA, alphal =0.5 data: HA, alphal =0.25 




1=7 1=10 l=25 l=50 1=100 1=195 l=7 1=10 l=25 l=50 1=100 1=195 l=7 1=10 l=25 l=50 1=100 1=195 l=7 1=10 l=25 l=50 1=100 1=195 



Figure 9: Simulation study, Setting 2 (differential regression, SNR=5): Same notation as 
in Figure El 
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data: HO data: HO o data: HA, alpha1=0.5 o data: HA, alpha"! =0.25 




l=7 1=10 [=25 l=50 1=100 1=250 l=7 1=10 l=25 l=50 1=100 1=250 1=7 1=10 l=25 l=50 1=100 1=250 1=7 1=10 l=25 l=50 1=100 !=250 



Figure 10: Simulation study, Setting 3 (differential regression, real predictor matrix): 
Same notation as in Figure El 



data: H0 o data: HO o data: HA, alpha2=0.5 o data: HA, alpha2=0.8 




k=10 k=25 k=50 k=75 k=10 k=25 k=50 k=75 k=10 k=25 k=50 k=75 k=10 k=25 k=50 k=75 



Figure 11: Simulation study, Setting 4 (differential network): Same notation as in Figure [8l 



26 



